Due to limitations in hardware platform computing power and storage resources, achieving energy-efficient and efficient convolutional neural networks (CNNs) by using embedded systems remains a primary challenge for hardware designers. In this context, a complete design of a heterogeneous embedded system implemented by using a system-on-chip (SoC) with a field-programmable gate array (FPGA) is proposed. This design adopts a cascaded input multiplexing structure, enabling two independent multiply-accumulate operations in a single DSP, reducing external memory access, enhancing system efficiency, and lowering power consumption. Compared to other designs, the power efficiency is improved by over 38.7%. The design framework is successfully deployed in a large-scale CNN network on low-cost devices, significantly improving power efficiency of the network model. The power efficiency achieved on the ZYNQ XC7Z045 device can even reach 102 Gops/W. Furthermore, when inferring the VGG-16’s CONV layers by using this framework, a frame rate of up to 10.9 fps is achieved, which demonstrates the framework’?s effective acceleration of CNN inference in power-constrained environments.